Building Databases

Author

James Van Slyke

Scientific Notation

When you first get R and R Studio set up, it may be using scientific notation to express larger numbers. So you’ll see numbers like this \[ 5.234e+10 \] This is a type of exponent, which is in a scientific notation format. Here’s a simpler example to understand what this means. Let’s start with a number like 28.

In scientific notation this would look like \[ 2.8e+01 \] Or in an exponential form more familiar \[ 2.8x10^1 \] So it’s 2.8 times 10 to the first power. 280 Would look like this \[ 2.8e+02/ or/ 2.8x10^2 \] It’s basically an easier way to represent larger numbers like 280 million (280,000,000) \[ 2.8e+08/ or/ 2.8x10^8 \] To turn this off this setting do the following

options(scipen = 999)

If you want to turn it back on, do this

options(scipen = 0)

More work on databases

Let’s make up a database based on covid figures from the New York Times.

  1. Create your objects
  2. Make sure to use quotations for objects that are names or titles (Remember these are categorical variables)
Countries <- c("United States", "India", "Brazil", "Russia", "UK")
Total_Cases  <- c(24249722, 10581837,8511770,3574330, 3466849)
Total_Deaths <- c(400810, 152556, 210299,65632,91470)

Then you can use the data.frame command to put them all together

Covid <- data.frame(Countries, Total_Cases, Total_Deaths)

You could actually do all these steps at the same time

Covid_Again <- data.frame(Countries = c("United States", "India", 
                          "Brazil", "Russia", "UK"), 
                          Total_Cases  = c(24249722, 10581837,8511770,
                                           3574330, 3466849),  
                          Total_Deaths = c(400810, 152556, 210299,65632,91470))

Another nice way to make a dataset is by using a tibble

This is part of the tidyverse package and simplifies the code somewhat. Notice that the command to make a tibble is actually tribble.

Covid_TR <- tribble(
            ~Countries, ~Total_Cases, ~Total_Deaths, 
            "United States", 24249722, 400810, 
            "India", 10581837, 152556, 
            "Brazil", 8511770, 210299, 
            "Russia", 3574330, 65632, 
            "UK", 3466849, 91470
)

A tibble is nice because it sets it up more like a spreadsheet.

Notice that the ~ specifies the columns or variables and then the rest are like rows.

Manipulate Data

Mortality rate is total deaths divided by the total number of cases. You can use R to calculate this for you and then create the object.

Mortality_Rate <-  c(Total_Deaths/Total_Cases)

Then we can add all four variables together to remake our covid data.frame

Covid <- data.frame(Countries,Total_Cases,Total_Deaths,Mortality_Rate)

Tidyverse supplies some other helps here if we are using tibbles.

We can use mutate to add in the other variable based on a computation.

Covid_TR <- mutate(Covid_TR, Mortality_Rate = Total_Deaths/Total_Cases)

We can use rename to change the name of our variable

Covid_TR <- rename(Covid_TR, Mortality = Mortality_Rate)

You can practice this on your own.